Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

نویسندگان

  • Marco Sandri
  • Paola Zuccolotto
چکیده

Variable selection is one of the main problem faced by data mining and machine learning techniques. For the most part, these techniques are more or less explicitly based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. First, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Second, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data. Corresponding author: Paola Zuccolotto, Quantitative Methods Department, University of Brescia, C.da Santa Chiara 50, 25122 Brescia, Italy. Email: [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quasi Random Deployment Strategy for Reliable Communication Backbones in Wireless Sensor Networks

Topology construction and topology maintenance are significant sub-problems of topology control. Spanning tree based algorithms for topology control are basically transmission range based type construction algorithms. The construction of an effective backbone, however, is indirectly related to the placement of nodes. Also, the dependence of network reliability on the communication path undertak...

متن کامل

Induction of Quadratic Decision Trees using Genetic Algorithms and k-D Trees

Genetic Algorithm-based Quadratic Decision Tree (GA-based QDT) has been applied successfully in various classification problems with non-linear class boundaries. However, the execution time of GA-based QDT is quite long. In this paper, a new version of GA-based QDT, called Genetic Algorithm-based Quadratic Decision Tree with k-D Tree (GA-based QDT with k-D Tree), is proposed. In the proposed al...

متن کامل

A comparative study of quantitative mapping methods for bias correction of ERA5 reanalysis precipitation data

This study evaluates the ability of different quantitative mapping (QM) methods as a bias correction technique for ERA5 reanalysis precipitation data. Climate type and geographical location can affect the performance of the bias correction method due to differences in precipitation characteristics. For this purpose, ERA5 reanalysis precipitation data for the years 1989-2019 for 10 selected syno...

متن کامل

Optimization of sediment rating curve coefficients using evolutionary algorithms and unsupervised artificial neural network

Sediment rating curve (SRC) is a conventional and a common regression model in estimating suspended sediment load (SSL) of flow discharge. However, in most cases the data log-transformation in SRC models causing a bias which underestimates SSL prediction. In this study, using the daily stream flow and suspended sediment load data from Shalman hydrometric station on Shalmanroud River, Guilan Pro...

متن کامل

روشی نوین در کاهش نوفه رایسین از مقدار بزرگی سیگنال دیفیوژن در تصویربرداری تشدید مغناطیسی (MRI)

The true MR signal intensity extracted from noisy MR magnitude images is biased with the Rician noise caused by noise rectification in the magnitude calculation for low intensity pixels. This noise is more problematic when a quantitative analysis is performed based on the magnitude images with low SNR(<3.0). In such cases, the received signal for both the real and imaginary components will fluc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Statistics and Computing

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2010